sceptre 4: Comparing
SCEPTRE and Seurat DE across datasetsThis is the fourth installment of the debugging SCEPTRE series. I aim to answer the following questions.
I begin by comparing the SCEPTRE results obtained on the Frangieh data to those obtained on the Papalexi data. I downsample the set of Frangieh pairs so that the Frangieh data contain the same number of pairs as the Papalexi data. SCEPTRE clearly exhibits better calibration on the Papalexi data than on the Frangieh data.
Next, for each dataset, I plot the pairs that pass QC alongside the pairs that do not. The comparison here is along two axes: first, across datasets, and second, within a given dataset, across pairs that passed QC versus pairs that did not. To enable meaningful comparisons along both axes, I performed downsampling so that each dataset-QC status (i.e., passed versus not passed) tuple would contain the same number of pairs. In other words, I performed downsampling so that each of the eight sets of pairs below (two for each of the four plots) would have the same number of pairs.
We see that QC makes an enormous difference on the Frangieh data: the pairs that survive QC are fairly well calibrated, while those that are filtered out are poorly calibrated. By contrast, QC does not make much of a difference on the Papalexi data: the pairs that pass QC and those that do not are both fairly well-calibrated (altough there does seem to be a small difference).
Next, I make a similar figure, this time plotting pairs with a “good” skew-normal fit alongside those with a “bad” (defined as an “adequate,” “subpar,” or “poor”) skew-normal fit. Again, I downsample so that each dataset-skew-normal fit status (i.e., “good” versus “bad”) tuple contains the same number of pairs. Note that the number of pairs plotted below is much smaller than the number of pairs plotted above; thus, we are not looking as far out into the tail of the distribution. Still, the trend is clear: SCEPTRE exhibits similar performance on the Papalexi data regardless of whether the skew-normal fit is good or bad. By contrast, on the Frangieh data, a good skew-normal fit is required for p-value calibration.
These two sets of results are concordant: on the Frangieh and Papalexi datasets, almost all of the variance in skew-normal fit quality is due to the effective sample size. Thus, pairs that pass QC tend to exhibit a good skew-normal fit, and conversely, pairs that are filtered tend to exhibit a bad skew-normal fit. In conclusion, a large effective sample size (or, roughly equivalently, a good skew-normal fit) is required for SCEPTRE to exhibit good calibration on the Frangieh data but not on the Papalexi data.
I attempt to answer the following questions: Why on the Papalexi data does effective sample size not exert a larger impact on the calibration quality of SCEPTRE? Put differently, how can pairs with a low effective sample size (and thus a poor skew-normal fit) exhibit good calibration? And why does this phenomenon not hold on the Frangieh data?
First, to get our bearings, I plot some of the SCEPTRE resampling distributions on the Papalexi data that fall into the “good,” “adequate,” “subpar,” and “poor” categories. Here are four pairs in the good category:
Here are four in the adequate category:
Here are four in the subpar category.
Finally, here are eight in the poor category.
For comparison, I plot eight resampling distributions in the poor category from the Frangieh co-culture data.
The Papalexi null distributions are less “lumpy” and multimodal than the Frangieh null distributions. Inflation occurs when there is a “lump” far out into the tail of the distribution, and this seems to be a more common occurrence on the Frangieh data than on the Papalexi data.
Next, I compute \(p_{rat}\), the ratio of the skew-normal p-value to the empirical p-value on the Frangieh and Papalexi datasets. \(p_{rat}\) quantifies the extent to which a skew-normal p-value is miscalibrated relative to the exact (permutation-based) p-value, with values larger than one indicating inflation. I downsample the Frangieh datasets so that all datasets have the same number of pairs to facilitate comparisons.
The Papalexi data do not have many points far above one, whereas the Frangieh data do. Furthermore, the Frangieh data have many more “poor” pairs than Papalexi. Next, I condition on “bad” (i.e., not “good”) pairs and compare across datasets, again downsampling to enable comparisons.
The trend is similar: conditional on the event that a pair exhibits a “bad” skew-normal fit, a Frangieh pair is more likely to exhibit severe inflation than a Papalexi pair. Finally, for all datasets, and for a given value of \(p_{\textrm{rat}}\) (e.g., \(p_{\textrm{rat}} = 2\)), I compute the fraction of “bad” pairs that falls above \(p_{\textrm{rat}}\). The results are plotted below. The line corresponding to the Papalexi data (i.e., the purple line) falls far below the lines corresponding to the Frangieh data. This confirms that, conditional on the event that a pair exhibits a “bad” skew-normal fit, a Frangieh pair is much more likely to exhibit severe inflation than a Papalexi pair.
We can answer the two questions posed at the top of this writeup as follows.
Answer: Briefly, using a skew-normal approximation to the null distribution causes more problems on the Frangieh data than on the Papalexi data. We have the following simple model for the probability that a given pair exhibits inflation:
\[\mathbb{P}(\textrm{inflation}) = \mathbb{P}(\textrm{inflation} | \textrm{poor skew-normal fit}) \mathbb{P}(\textrm{poor skew-normal fit}).\] Both quantities on the right-hand side of the equation are substantially larger on the Frangieh data than they are on the Papalexi data.
Answer: Inflation (or lack thereof) is closely tied to skew-normal fit quality on the Frangieh data but not on the Papalexi data. This discrepancy comes down to differences in the shapes of the resampling distributions across datasets, with Frangieh resampling distributions being more likely to exhibit “lumps” far out into the tail.
The Seurat DE results (faceted by dataset and colored by QC status) are below.
## `summarise()` has grouped output by 'pass_qc'. You can override using the
## `.groups` argument.
I applied Seurat DE to a subset of the Papalexi and Frangieh data, obtaining exact p-values by permuting the data and computing the Mann-Whitney (MW) test statistic B=100,000 times. I assessed the quality of the standard normal approximation to the null distribution by running a KS test on the resampled test statistics. I plot examples of empirical null distributions in the “good,” “adequate,” “subpar,” and “poor” categories.
Here are four in the “good” category.
Here are four in the “adequate” category.
Here are four in the “subpar” category.
And here are four in the “poor” category.
Next, I study the relationship between standard normal fit quality and effective sample size. These results perhaps should be taken with a grain of salt, as I sampled non-randomly from the set of Papalexi pairs (focusing on those with very small Mann-Whitney p-values). I plot standard normal fit quality (as quantified by KS statistic) against effective sample size for a set of pairs sampled from the Papalexi dataset.
We see that as effective sample size increases, the KS statistic decreases, indicating improved fit quality. Next, I plot p\(_\textrm{rat}\) against the KS statistic.
Finally, I plot the empirical p-value against the MW p-value.